feat(macOS): add vfkit backend for ephemeral and persistent VMs by tnk4on · Pull Request #259 · bootc-dev/bcvk

tnk4on · 2026-05-08T03:35:52Z

macOS has no KVM/QEMU, so this adds vfkit as the VM backend. Unlike the Linux path which uses podman containers for isolation, macOS launches vfkit directly with per-VM resource separation.

Ephemeral VMs use a fully diskless architecture: a custom nbdkit EROFS plugin (crates/nbdkit-erofs-plugin) dynamically generates EROFS rootfs, FAT32 ESP, and GPT partition table from the container overlay directory, served via NBD. No disk images, shell scripts, or mkfs commands are needed. SSH keys are injected via initramfs CPIO append. Plugin distribution method is TBD.

Persistent VMs use EFI boot with disk images (EFI firmware is provided by vfkit via Apple Virtualization.framework, no external firmware files needed). The vfkit/ module mirrors the libvirt/ directory structure and provides the same subcommands: run, list, ssh, stop, start, rm, rm-all, inspect. Disk images with podman/buildah xattrs (security.selinux) are automatically cleaned before launch since Apple Virtualization.framework rejects them.

The only runtime dependency is Podman — the macOS PKG installer bundles vfkit and gvproxy, so no additional installation is needed. Homebrew is also supported.
Build and run:

cargo build --release
codesign -fs - target/release/bcvk

No entitlements needed — bcvk launches vfkit as a subprocess.

Tested manually on macOS (Apple Silicon) with rootful and rootless podman machine.

Fixes: #21

Assisted-by: Claude Code (Claude Opus 4.6)

gemini-code-assist

Code Review

This pull request introduces macOS support for managing ephemeral and persistent VMs using the vfkit backend and gvproxy for networking. It includes logic for extracting kernels from bootc containers, creating SquashFS root filesystems, and managing VM lifecycles through new CLI subcommands. Feedback highlights security concerns regarding potential command injection during SSH key setup and a TOCTOU race condition in port allocation. Additionally, the use of hardcoded global paths in /private/tmp was flagged as problematic for multi-user environments, and improvements were suggested for handling I/O results when communicating with gvproxy.

gemini-code-assist · 2026-05-08T03:37:59Z

+        std::path::PathBuf::from("/private/tmp/bcvk/vms")
+    }
+


Using a hardcoded global path in /private/tmp/bcvk for VM metadata and sockets is problematic on multi-user systems. It can lead to permission conflicts and security risks if multiple users attempt to run the tool simultaneously. Since podman machine on macOS typically shares the user's home directory by default, consider using a user-specific path like ~/.cache/bcvk/vms or ensuring the directory in /private/tmp is user-private (e.g., by including the UID in the name and setting 0700 permissions).

gemini-code-assist · 2026-05-08T03:37:59Z

+        "#!/bin/bash\n\
+         mkdir -p /sysroot/var/roothome/.ssh\n\
+         chmod 700 /sysroot/var/roothome/.ssh\n\
+         echo '{}' > /sysroot/var/roothome/.ssh/authorized_keys\n\
+         chmod 600 /sysroot/var/roothome/.ssh/authorized_keys\n\
+         chown -R 0:0 /sysroot/var/roothome/.ssh\n",
+        pubkey
+    );


The SSH public key is inserted into a shell script using single quotes. While SSH public keys usually do not contain single quotes, a corrupted or maliciously crafted key could lead to command injection within the initramfs environment. A safer approach would be to write the key directly to a file in the CPIO archive and have the script reference that file, or use a heredoc with a quoted delimiter (e.g., cat <<'EOF').

gemini-code-assist · 2026-05-08T03:37:59Z

+    let mut response = vec![0u8; 1024];
+    let _ = std::io::Read::read(&mut stream, &mut response);
+    let response_str = String::from_utf8_lossy(&response);


Ignoring the result of the read operation is brittle. It does not account for partial reads or I/O errors. This could lead to incorrect status checks if the response is not fully read in the first chunk or if the connection is closed prematurely.

Suggested change

let mut response = vec![0u8; 1024];

let _ = std::io::Read::read(&mut stream, &mut response);

let response_str = String::from_utf8_lossy(&response);

let mut response = vec![0u8; 1024];

let n = std::io::Read::read(&mut stream, &mut response).context("reading gvproxy response")?;

let response_str = String::from_utf8_lossy(&response[..n]);

gemini-code-assist · 2026-05-08T03:37:59Z

+pub fn find_available_ssh_port() -> u16 {
+    use rand::Rng;
+    let mut rng = rand::rng();
+    const PORT_RANGE_START: u16 = 2222;
+    const PORT_RANGE_END: u16 = 3000;
+    for _ in 0..100 {
+        let port = rng.random_range(PORT_RANGE_START..PORT_RANGE_END);
+        if std::net::TcpListener::bind(("127.0.0.1", port)).is_ok() {
+            return port;
+        }
+    }
+    for port in PORT_RANGE_START..PORT_RANGE_END {
+        if std::net::TcpListener::bind(("127.0.0.1", port)).is_ok() {
+            return port;
+        }
+    }
+    PORT_RANGE_START
+}


The find_available_ssh_port function has a Time-of-Check to Time-of-Use (TOCTOU) race condition. The port is checked for availability by binding and then immediately closing it, but it could be taken by another process before gvproxy actually attempts to use it. While the caller has retry logic, this approach can lead to intermittent failures in busy environments.

cgwalters

Thanks so much for starting this!

I only skimmed so far

cgwalters · 2026-05-08T14:30:30Z

@@ -0,0 +1,136 @@
+//! Cross-platform SSH option types shared between Linux and macOS backends.
+//!
+//! Extracted from ssh.rs to avoid pulling in Linux-only dependencies on macOS.


Can you do a "prep" PR which refactors out common code?

Sure, will do.

cgwalters · 2026-05-08T14:31:15Z

+                if let Err(e) = Command::new("kill")
+                    .args([&vm.gvproxy_pid.to_string()])


Surely we can just use rustix::process::kill_process please look for other things like this

cgwalters · 2026-05-08T14:31:39Z

+        print!("Remove all ephemeral VMs? [y/N]: ");
+        std::io::stdout().flush()?;
+        let mut input = String::new();
+        std::io::stdin().read_line(&mut input)?;
+        let input = input.trim().to_lowercase();
+        if input != "y" && input != "yes" {


Hmm this may not be a new thing but let's try to use say dialoguer or so

Agreed. Linux has the same pattern too. How would you like to handle this — separate follow-up?

cgwalters · 2026-05-08T14:32:38Z

+
+/// Options for launching an ephemeral VM via vfkit.
+#[derive(clap::Parser, Debug)]
+pub struct RunEphemeralOpts {


Also idelaly share a clap #[flatten] struct w/linux

Makes sense. There's a good overlap (memory, vcpus, debug, execute, ssh_keygen) but macOS also needs name, kernel_args, gui, and detach which don't exist on Linux. What would be the best way to split it?

cgwalters · 2026-05-08T19:17:36Z

+//!
+//! Boot flow:
+//! 1. Extract kernel + initramfs from container image
+//! 2. Create SquashFS rootfs (lz4, cached by digest)


The thing is that's O(data) to create whereas to me a key bit of ephemeral today is that it's "cheap" to launch.

Also, we've invested in EROFS for composefs as opposed to squashfs.

I'm not fundamentally opposed to making lookaside disk images (as apple/container does too) in the short term BUT I think in the medium term we really need something efficient.

This also relates to #213 - basically one model here might be where we make a composefs upper and the object store gets backed by remote access to the podman-machine store?

Thanks for the review. Based on your feedback, this PR has been reworked from the initial SquashFS implementation.

Adopted a fully diskless architecture — no disk images, shell scripts, or mkfs commands are generated at any point.

Chose NBD as the transport protocol, using Apple Virtualization.framework's VZNetworkBlockDeviceStorageDeviceAttachment for EFI boot.

Serve NBD via podman run -p inside podman machine, reusing gvproxy's TCP port forwarding for NBD traffic between host and VM.

Built a custom nbdkit EROFS plugin from scratch (crates/nbdkit-erofs-plugin) that dynamically generates EROFS rootfs, FAT32 ESP, and GPT partition table from the container overlay directory using the regions pattern.

This approach could also be applied to a Windows/Hyper-V backend.

One thing still TBD is the plugin distribution method. For local testing, I've been manually placing the .so inside podman machine. A few options I'm considering:

Bundle in the bcvk RPM. Could be included at podman machine image build time, then bind-mounted into the nbdkit container with -v.

Ship a dedicated container image with the plugin pre-installed. Adds image maintenance overhead.

Upstream the plugin to nbdkit. Probably too bcvk-specific to be a good fit.

There may be other approaches too. Any thoughts on the best way to handle this?

OK, NBD seems like it will work for now.

, reusing gvproxy's TCP port forwarding for NBD traffic between host and VM.

Ideally though we don't involve IP networking for this. I think we could have the VM connect to a unix domain socket instead?

Built a custom nbdkit EROFS plugin from scratch (crates/nbdkit-erofs-plugin) that dynamically generates EROFS rootfs, FAT32 ESP, and GPT partition table

Hmm, but for ephemeral we don't need a GPT partition table (or ESP), we just need any mountable filesystem. We should be doing a direct kernel boot.

from the container overlay directory using the regions pattern.

Ah...interesting. Hmm, I have questions about that but I guess I can toss my own LLM at the code to ask

One thing still TBD is the plugin distribution method.

Worth noting the larger "we" here control all 3 actors involved here (podman machine, bcvk, and the initramfs inside the guest).

Bundle in the bcvk RPM. Could be included at podman machine image build time, then bind-mounted into the nbdkit container with -v.

There's no RPM on MacOS, and we don't require bcvk installed on podman machine today (it would drag in the virt stack into the podman machine host OS among other things).

I think the thing that would keep the complexity here bundled inside bcvk (ignoring cross-architecture issues which would make this all way more complex) is to bundle the shared library inside our binary, and then use the Podman-machine connection to dynamically inject it into the target VM.

Basically then we can change what happens here at any point just by changing bcvk, no dependency on updates to podman machine.

Sorry for the delay on this update. I prototyped vsock and ran benchmarks — here are the key results. The user-facing performance difference is negligible for typical workloads, so the choice is more of an architectural decision.

Benchmark results (M1 MAX, vfkit VM, dd 1GB):

TCP via gvproxy: 938–1031 MB/s

vsock via libkrun: 575–605 MB/s

This may seem counterintuitive, but TCP runs at the host kernel level via gvproxy, while vsock goes through libkrun's userspace muxer with more hops (4 hops vs 2 for TCP), which explains the gap.

TCP was the original choice for this PR since it works with stock Podman. After the PoC, TCP remains the recommended approach — vsock would also require upstream features that don't exist yet:

containers/podman: vsock port forwarding support in the machine provider

containers/krunkit: connect mode for vsock socket creation

If there's interest in revisiting vsock in the future, the PoC code is preserved in the wip/macos-vfkit-vsock branch.

Hmm, but for ephemeral we don't need a GPT partition table (or ESP), we just need any mountable filesystem. We should be doing a direct kernel boot.

Direct kernel boot on macOS has a consideration: since vfkit runs on the host, kernel and initramfs need to be extracted from the container image (inside podman machine) and written to a shared path (/private/tmp via virtiofs) so vfkit can access them.

This is a tradeoff: the current design generates everything dynamically via nbdkit with no file extraction — the EROFS plugin computes responses on demand from the overlay directory. Direct kernel boot would require writing vmlinuz + initramfs to the host filesystem before launch (cacheable by image digest, so only first-run cost).

I think the thing that would keep the complexity here bundled inside bcvk (ignoring cross-architecture issues which would make this all way more complex) is to bundle the shared library inside our binary, and then use the Podman-machine connection to dynamically inject it into the target VM.

Implemented. The .so is embedded in the bcvk binary via include_bytes! and on first ephemeral run, bcvk automatically builds a nbdkit container image inside podman machine. No rpm-ostree install or manual .so deployment needed.

The remaining challenge is the .so build itself. It's a nbdkit plugin shared library that can only be built on Linux. For CI, we'll need a Linux job to build the .so and make it available to the macOS/Windows build jobs. For local development, developers need either a Linux environment (e.g. podman run) or a cross-compile toolchain.

The remaining challenge is the .so build itself. It's a nbdkit plugin shared library that can only be built on Linux.

cargo-zigbuild seems to be increasingly popular, it'd make sense to me to do that by default.

That said, we can obviously support/encourage a flow on mac/windows that uses Linux containers to build.

I've set up cargo-zigbuild for cross-building the plugin. Added make plugin-so-aarch64 and make plugin-so-x86_64 targets to switch between architectures. Updated the CI workflow accordingly. Tested locally and it passes.

cgwalters · 2026-05-19T13:18:03Z

I think we would need to backfill CI here.

One thing that may help significantly is for us to have an opt-in mode that simulates the proposed MacOS architecture, but on Linux - that should be easy to do, we can have a flow that sets up podman machine and runs that way.

In fact, we could just make that a first class operation by default - detect if we're using podman machine on Linux and have things Just Work.

macOS has no KVM/QEMU, so this adds vfkit as the VM backend. Ephemeral VMs use direct kernel boot with SquashFS, persistent VMs use EFI boot. The vfkit/ module mirrors the libvirt/ directory structure, and CLI options match Linux where applicable. Build and run on macOS: cargo build --release codesign -fs - target/release/bcvk Tested on macOS (Apple Silicon) with rootful and rootless podman machine. Assisted-by: Claude Code (Claude Opus 4.6) Signed-off-by: Shion Tanaka <shtanaka@redhat.com>

macOS has no KVM/QEMU, so this adds vfkit as the VM backend. Ephemeral VMs use a custom nbdkit EROFS plugin that dynamically generates rootfs, ESP, and GPT from the container overlay via NBD. Persistent VMs use EFI boot. The vfkit/ module mirrors the libvirt/ directory structure, and CLI options match Linux where applicable. Plugin distribution method is TBD. Build and run on macOS: cargo build --release codesign -fs - target/release/bcvk Tested on macOS (Apple Silicon) with rootful and rootless podman machine. Assisted-by: Claude Code (Opus 4.6) Signed-off-by: Shion Tanaka <shtanaka@redhat.com>

Replace the last 2 instances of Command::new("kill") with rustix::process::kill_process in the --replace VM cleanup path. All macOS code now uses rustix for process signals, as requested by maintainer in PR bootc-dev#259 review. Assisted-by: Claude Code (Claude Opus 4.6) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- to-disk with APFS clonefile-based base disk caching - vm_helpers.rs shared with Windows (12 functions) - nbdkit .so plugin auto-build via include_bytes! embedding - CLI options unified with Linux/Windows (--ssh, --ssh-wait, --force, --stop, --install-log, --label, --format, --itype) Assisted-by: Claude Code (Claude Opus 4.6) Signed-off-by: Shion Tanaka <shtanaka@redhat.com>

tnk4on · 2026-06-04T20:33:00Z

I think we would need to backfill CI here.

One thing that may help significantly is for us to have an opt-in mode that simulates the proposed MacOS architecture, but on Linux - that should be easy to do, we can have a flow that sets up podman machine and runs that way.

In fact, we could just make that a first class operation by default - detect if we're using podman machine on Linux and have things Just Work.

Agreed, CI is needed. Simulating the macOS architecture on Linux via podman machine is an interesting approach. The podman-machine interaction (nbdkit, to-disk, SSH) should work on Linux as-is, though it would need a QEMU hypervisor backend to replace the vfkit layer.

cgwalters · 2026-06-04T22:02:33Z

though it would need a QEMU hypervisor backend to replace the vfkit layer.

I don't see any issues with that, do you?

tnk4on · 2026-06-04T22:21:21Z

though it would need a QEMU hypervisor backend to replace the vfkit layer.

I don't see any issues with that, do you?

No, I don't see any issues either. Let's go with that approach.

cgwalters · 2026-06-04T22:35:09Z

+          ExecStart=/bin/sh -c 'mkdir -p /run/systemd/system/sysinit.target.wants && cp /usr/lib/systemd/system/bcvk-journal-stream.service /run/systemd/system/ && ln -s ../bcvk-journal-stream.service /run/systemd/system/sysinit.target.wants/'\n",
+    );
+
+    write_file(


This stuff needs to be deduplicated

Done. Three shared units now reference kit/src/units/ via include_bytes!. journal-stream remains inline since the output device differs between Linux and macOS/Windows.

cgwalters · 2026-06-04T22:37:18Z

+    Ok(result)
+}
+
+fn walk_recursive(


I prefer using cap-std for stuff like this

Done. dir_walk.rs now uses cap_std::fs::Dir for traversal. Unix metadata (mode/uid/gid/mtime/nlink) is obtained via rustix::fs::statat since cap-std's Metadata doesn't expose those. Symlink targets use rustix::fs::readlinkat because container overlays contain absolute symlinks that cap-std rejects as outside the boundary.

cgwalters · 2026-06-04T22:38:33Z

+use crate::regions::{Region, RegionType};
+use std::sync::Arc;
+
+const EROFS_MAGIC: u32 = 0xE0F5E1E2;


I think I mentioned this before but we also maintain a huge amount of EROFS Rust code in https://github.com/composefs/composefs-rs/tree/main/crates/composefs/src/erofs which is today specialized for composefs, but we could probably at least try to lift/share some of the code.

Maybe we could factor it out into a crate.

That said there is also https://lib.rs/crates/erofs-rs which is actively developed it seems, but I have not looked closely at it.

I looked into both composefs-rs and erofs-rs. The on-disk format constants and struct definitions (superblock, inode, dirent) are clearly duplicated across projects. If composefs-rs factored out its format definitions into a standalone crate, bcvk could use that instead of its own. The build logic itself can't be shared — bcvk generates EROFS on demand via NBD pread rather than writing to a file.

- Cross-build .so via cargo-zigbuild (make plugin-so-aarch64/x86_64) - Deduplicate initramfs units with include_bytes! from shared units/ - Use cap-std for directory walking in nbdkit-erofs-plugin - Add per-architecture plugin-so Makefile targets Assisted-by: Claude Code (Claude Opus 4.6) Signed-off-by: Shion Tanaka <shtanaka@redhat.com>

cgwalters · 2026-06-05T19:45:33Z

+         if podman image exists {image}; then exit 0; fi; \
+         mkdir -p /var/tmp/bcvk; \
+         printf '%s' '{b64}' | base64 -d > /var/tmp/bcvk/plugin.so; \
+         printf 'FROM quay.io/fedora/fedora:latest\\nRUN dnf install -y nbdkit nbdkit-basic-plugins && dnf clean all\\nCOPY plugin.so /plugin.so\\n' | \


Hmmm. So this won't ever get updated on an existing machine unless someone prunes the image.

That's probably ok for a PoC, but I suspect it'll bite us down the line.

Obviously, we could ship a pre-built version of the container image from upstream...but then we don't need the .so baked into the binary.

There is a bigger path - we could use https://github.com/vi/rust-nbd (hmm looks like it could use some revitalization).

I mean, we're generating so much code here that I think the NBD server implementation isn't like that much more - and when we do that we can have a single executable binary (not a container image) that we directly run on the target host as a systemd unit?

Edit: Also when we go that route, we don't need to deal with the C interface stuff.

gemini-code-assist Bot reviewed May 8, 2026

View reviewed changes

tnk4on force-pushed the wip/macos-vfkit-pr branch from b759a37 to d8a0f71 Compare May 8, 2026 08:33

cgwalters reviewed May 8, 2026

View reviewed changes

cfergeau mentioned this pull request May 11, 2026

Add code to uncompress arm64 kernels? crc-org/vfkit#451

Open

tnk4on force-pushed the wip/macos-vfkit-pr branch 4 times, most recently from 967611a to 4a1c1dd Compare May 11, 2026 16:51

tnk4on mentioned this pull request May 26, 2026

Add support for Windows #162

Open

1 task

tnk4on added 2 commits June 2, 2026 04:26

tnk4on force-pushed the wip/macos-vfkit-pr branch from 4a1c1dd to b1c2573 Compare June 1, 2026 19:28

cgwalters reviewed Jun 4, 2026

View reviewed changes

tnk4on force-pushed the wip/macos-vfkit-pr branch from ee93f4f to 7477a1d Compare June 5, 2026 14:32

cgwalters reviewed Jun 5, 2026

View reviewed changes

		if let Err(e) = Command::new("kill")
		.args([&vm.gvproxy_pid.to_string()])

Conversation

tnk4on commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

cgwalters left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cgwalters commented May 19, 2026

Uh oh!

tnk4on commented Jun 4, 2026

Uh oh!

cgwalters commented Jun 4, 2026

Uh oh!

tnk4on commented Jun 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cgwalters Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

tnk4on commented May 8, 2026 •

edited

Loading

cgwalters Jun 5, 2026 •

edited

Loading